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ABSTRACT 

A method is proposed for the detection of item bias 
with respect to observed or unobserved subgroups. The method uses 
quasi-loglinear models for the incomplete subgroup x test score x 
item 1 X ... X item k contingency table. If the subgroup membership 
is unknown, the models are the incomplete-latent-class models of S. 
J. Haberman (1979). The (conditional) Rasch model is formulated as a 
quasi-loglinear model. The parameters in this model that correspond 
to the main effects of the item responses are the conditional 
estimates of the parameters in the Rasch model. Item bias can then be 
tested by c<^mparing the quasi-loglinear-Rasch model with models that 
contain parameters for the interaction of item responses and the 
subgroups. An example uses data from a test taken by 286 Dutch 
undergraduates who took a multiplication test using Roman numerals 
and numbers written out in Dutch. Some of tl:e examinees had received 
training in multiplying Roman numerals. It was expected that Roman 
items would be biased, and the procedMre confirmed this bias. Five 
tables present the models and study data. A 55-item list of 
references is included. (Author /SLD) 
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Abstract 



A method Is proposed for the detection of Item bias with respect to 
observed or unobserved subgroups. The method uses quasl-logllnear 
models for the Incomplete subgroup x testscore x Item 1 x x 
Itek k contingency table. If subgroup membership Is unknown the 
models are Haberman's Incomplete-latent-class models. 

The (conditional) Rasch model Is formulated as a quasi-logllnear 
model. The parameters In this loglinear model, that correspond to 
the main effects of the Item responses, are the conditional 
estimates of the parameters In the Rasch model. Item bias can then 
be tested by comparing the quasl-logllnear-Rasch model with models 
that contain parameters for the Interaction of Item responses and 
the subgroups. 
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Introduction 



Educational or psychological tests are biased If the testscores of 
equally able test takers are systematically different between 
racial » ethnic, cultural etc. subgroups. Biased test scores may 
lead to unfair decisions or erroneous conclusions about Individuals 
from particular subgroups. A test score is biased only If one or 
more of the test Items are biased. A test Item Is biased if 
Individuals with the same ability level from different subgroups 
have a different probability of a right response. I.e. the item has 
different difficulties In different subgroups. A test can be made 
fairer by deleting or Improving the biased items. 

Binet and Simon (I9I6; see Jensen 1980, p367) were already 
concerned with bias when they applied their test of general 
intelligence that was standardized on working c^.ass childeren to 
children of higher social status. 

To assess bias some unbiased criterion measure of ability Is 
needed. In some studies an external criterion for ability is at 
hand (e.g. Petersen and Novick, 1976). In most practical 
situations, however, no such external criterion Is available and 
some criterion for ability internal to the test itseU is used. 
Therefore most item bias detection techiques that are discus'sed in 
the literature use an internal criterion in some manner. The way in 
which this is done best distinguishes the methods from each other. 
Reviews are given by Osterlind (1983); Rudner Getson and Knight 
(1980); and Shepard, Camilli and Averill (I98I). Handbooks on item 
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bias detection methods and research are Berk (1982) and Jensen 
(1980). 

In the earlier item bias detection methods there Is no explicit 
control of ability. For Instance, in the analysis of covarlance 
approach (Cardall A Coffman, 1965) transformed p-values are 
analyzed In a subgroup x Items design. If there Is a significant 
Item by subgroup Interaction, the Item Is considered biased. The 
analysis of variance assumption of equal cell variances Is met by 
transforming the p-values by an arcsin transformation^ Cleary and 
Hilton (1968), Hoepfner and Strickland (1972) and Jensen (1973) 
give further examples of this method. 

The oldest and most popular Item bias detection method Is the 
transformed Item difficulty method (Thurstone, 1925; Angoff, 1982; 
Angoff & Ford, 1973). It Is conceptually very similar to the 
analysis of variance method, because It also studies the Item x 
subgroup Interaction of Item difficulty. Angoff converted each p- 
value to a normal deviate (called delta's) by an inverse normal 
transformation. For all Items delta values are compared between two 
subgroups by plotting these pairs of delta's In a bivariate graph. 
Angoff claims that the delta pairs for each item scatter around a 
straight line if the items arc unbiased. If an item falls at some 
distance from the line this indicates an item x subgroup 
interaction. The Item 1s then considered biased. Examples of 
practical application of this method are in Dorans (1982) and 
Donlon, Hicks and Wallmark (1980). 

Both the analysis of variance method and the transformed item 
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difficulty iiethod analyse subgroup x Items Interactions in item 
statistics on the subgroup level. Consequently, in these methods 
control for ability must be performed through correcting for 
differences between subgroup-level item statistics. This Is 
logically unsatisfactory because, according to the deiinitlon of 
Item bias, control for ability must be performed on the Individual 
level. It Is also unsatisfactory In practice. Hunter (1975) and 
Shepard Camilli and Williams (1985) show that when the Items vary 
In difficulty and the distribution of ability is different in 
different subgrou. items x subgroups Interactions can arise in 
perfectly unbiased tests. 

A bette** way to control for ability is to use the raw score of 
the remaining test Items as an estimate of ability. Item bias 
detection methods based on this idea, called chl-square methods, 
are proposed by Scheuneman (1979) and Mellenbergh (1982). 
Scheuneman uses data from an item response x subgroup x scoregroup 
contingency table to test the hypothesis that within each 
scoregroup the prooabilities of a positive Item response ere the 
samn for all subgroups. If the hypothesis Is rejected the Item are 
considered biased. Baker (1981) criticized Scheuneman's methods on 
the grounds that the distribution of the test statistic Is unknown 
because Scheuneman used only the data from the positive responses. 
Camilli (1979) and Nungester (1977) (see Ironson 1982) proposed a 
test statistic based on both the correct and the incorrect Item 
responses which Is asympotlcally distributed as a chl -square. 
Mellenbergh (1982) modified Scheuneman *s method so that It fits In 
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the general theory of loglineair and logit models for contingency 
tables. This yields a parametric model describing different types 
of Dias which can also be tested by chl-square statistics. 

Chl-square methods can detect Item bias very well. Rudner, 
Getson and Knight (1980) show that Scheuneman's method can detect 
Item bias in simulated data where the responses are generated form 
a three parameter logistic model with different slope and locations 
In different groups. Van der Filer, MellenDergh, Ader and Wljn 
(1984) show that Mellenbergh's (1982) method works well In both 
empirical data and In simulated data generated by a certain three- 
parameter-normal ogive type model. Kok, Hellenbergh and van 6^,r 
Flier (1985) showed that the method also effectively detected 
experimentally Induced Item bias. Although In chl-square methods 
there Is a better control for ability level than In the analysis of 
variance method and the transformed Kern difficulty method, taking 
ability as the number right score c tne remaining items Is rather 
Informal and possibly Inappropriate. 

In Item-response theory, ability is described by fonnal 
parameters. In these models the probability of an individual 
response to a certain item is explained by parameters describing 
the individual's ability and the item's difficulty. An Item is 
considered biased if the item parameters are different for 
individuals with the same ability parameters from different 
subgroups. 

Lord (1980) used Birnbaum's three parameter logistic model to 
detect biased items. The model contains three parameters: a lower 
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asymptote, a slope, and a location parameter of the Item 
characteristic curve. The parameters are associated with guessing, 
discrimination, and difficulty, respectively. If one or more of tne 
estimated Item parameters differ significantly from subgroup to 
subgroup, the item is considered biased. 

Huthen and Lehman (igSS) uses multiple group factor analysis of 
dichotomous variables to test the invariance of the parameters of 
the tMO-paraneter-normal -ogive model over subgroups. 

Durovic (1975) as well as Wright, HMd and Draba (1975) "Cc the 
Rasch model to detect item bias. For each item the mean squared 
differences between the observed responses and expected 
probabilities of a correct response were computed and compared 
between two subgroups. 

In this paper the loglinear formulation of the Rasch model 
(Kelderman, 1984) is used to test the invariance of item parameters 
over subgroups. If the difficulty parameters vary from subgroup to 
subgroup the item is considered biased. Subgroup membership may be 
observed or unobserved. In some practical situations, items may be 
expected to be biased for certain subgroups of individuals, but It 
is not known a-priori to which subgroup each of tne individuals 
belongs. For example, for an item in an examination the probaoility 
of a correct response may be larger for a group of individuals with 
specific educational experiences than for individuals without that 
experience, or for an item in a mastery test the probability of a 
correct response may be larger for a subgroup of individuals having 
a different study strategy or for a subgroup of Individuals having 
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a different cognitive strategy to solve the Item, etc. In these 
examples, Information on the Individuals' subgroup membership may 
be difficult to observe or, as In the last example, the test 
behavior Itself may be the natural Indicator of subgroup 
membership. In this paper a logllnear Rasch model Is formulated 
where Item difficulty may also vary over .ubgroups that are not 
observed. 

In what follows, the choice of the Rasch model to detect Item 
bias Is discussed. Quasi -logllnear models are formulated for test 
data and the Rasch model is formulated as one of the<n. Some 
alternative models are described to test various aspects of item 
bias with respecv to known subgroups. Tne use of these tests Is 
Illustrated on a set of test data from Kok (1982) where Item bias 
was Introduced experimentally. Finally, corresponding latent class 
models for Item bias with respect to unknown subgroups are 
described, and the effects of this bias is discussed. 

Choice of model 

The Rasch model describes ine probability PlX-'^xJa) that an 
Individual with parameter a gives a response Xj to item j 

(j=I k), where Xj can take values Xj « 0,1 for a wrong (0) or a 

right (I) response: 

(1) PIX "X.la) = explx.la-6.)) / |l+exp(a-6.)) , 
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where 6j (j«l»...pk) Is a single item parameter describing tne 
difficulty of Item j. If this item parameter varies from subgroup 
to subgroup » the item is considered biased. Although the Rasch 
model is a rather simple model, its parsimony yields several 
virtues in using it to detect item bias. 

Firstly* unlike the Bimbaum model if in the Rasch model item A 
hiis a larger item parameter value than item B» the probability of 
getting a correct solution on item A is always smaller than the 
probability of getting a correct solution on items B regardless of 
the examinee's ability level. Consequently, if the data fit the 
Rasch model, it makes sense to assert that item A is more difficult 
than item B. The item parameter value may therefore justifiably be 
interpreted as the item's difficulty (Rasch, 1966a), so that 
differences in item parameters between different suogroups can be 
interpreted as differences in Item difficulty oetween subgroups. 
The dependence on the subgroups of the item parameters can then be 
analyzed to make a diagnosis of the item's flaws necessary to 
improve the item. 

Secondly, in item bias detection studies we are interested in 
invariance of item parameters over subgroups and not in the 
individual person parameter values within each subgroup. It is 
therefore a desirable property of the Rasch model that the item 
parameters are inferentially separable from the person parameters. 
The Rasch model is an exponential family model wherein the simple 
number right score T = Xj^ ♦ ... ♦ Xj^ is a sufficient statistic for 
the person parameter a . Assuming local Independence of the item 
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responses for a given value of a and after conditioning on the 
number right score talcing the vdlue t, the joint 
probability PCX^^x^,... ,X|^*Xj^|>t) of the item responses r .... 
X|( for a given score T«t becomes Rasch ! 1966b): 

(2) P(X^«x^.....X^»xjT»t) 

» exp{-Xj6^...-Xj^6^)/{jf ... exp(-x^6^...-Xj^6|^)) , 

t»X^+...+Xj^ 

By conditioning on the score, the nuisance parameter a has vanished 
(Rascn 1966b). In this paper the Invarlance over sub. ups 1 
(1>l,...,ffl) of the Joint Item response distributions for given 
values of T 

(3) ^i^h'\ ^k'^l^'^' * P(Xj»Xj....,X|^«Xj^|T«t) 

Is tested to study Item bias. According to model (2) any deviation 
of this Invarlance must be explained by differences In item 
difficulty between the subgroups. Note from (2) that the use of the 
Rasch P'^del to study item bias is both an observed score method and 
a latent-trait-model method. 

Thirdly, the conditional Rasch model can be formulated 
(Kelderman, 1984) as a quasi -logll near contingency table model 
(Flenberg, 1972; Bishop, Flenberg & Holland, 1975). Model (2) Is 
then equivalent to the hypothesis that the Item responses ^nd the 
score are quad Independent (Goodman, 1968) In the Incomplete score 
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X IteiR 1 X ... X Item k contingency table. Incomplete table 
methodology can be used to fom^ilate several hypotheses about Item 
bias by specifying alternative quasi -logllnear models that contain 
various subgroup dependent parameters. Testing the conditional 
Rasch model againSv such models yields a test of the hypotneses. 

Quasl^Loglinear-Models for the 
Incomplete Subgroup x Score x Item 1 x ... x Item k Table . 

Let fif^ w be the number of individuals from subgroup 1 

(1*1,. ..,m) with number right score T«t {t«0,l,.. .,k) and Item 

scores Xj^»Xj^, ...» X|^°X|j where Xj « 1 if item j {j«l,...,k) is 

answered correctly and Xj « 0 If Item j is answered incorrectly. 

Since It is logically impossible to have a test score that Is 

unequal to the number of correct item responses (excluding counting 

errors) the counts fj^j^ ^^^j^ are zero for t * E x^ . Table 1 

Ik i 

shows the 



table 1 



subgroup x score x item 1 x ... x Item 3 contingency table for 
subgroup i. Dashes denote cells that are logically or structurally 
zero cell. Contingency tables with structurally zero cells are 
called incomplete contingency tables. 

rienberg (1972; see also Bishop, Fienberg 1 Holland, 1975) 
presents a general theory for the statistical analysis of 
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Incomplete nultlway contingency tables by quasi -logll near models. 

We apply Flenbergs theory to the analysis of the subgroup x score x 

Item 1 X ... X Item k contingency table to detect Item bias. 

Let « be the expected counts for the table under some 

I • • • 

model. If t * + ... + the expected counts are again 
structurally zero. If t » Xi+...+x,j, the expected counts are 
structurally nonzero and these counts are explained by a quasi- 
logllnear model. The saturated or fully specified model for the 
table Is: 

(4) 1" '"ltxi...xn " 

u + u^d) + UgCt) + U3(xj) + ... + U(k+2)<*k) 

+ \}^2^it) * UijCixj) + ... + U(k+i)(k+2)*\-l*k' 

* Uiga^txi) + ... + Ui23...(k+2)<*^''l---*li' 

for 1 « 1 m; x^ - 0,1; x^ = 0,1; t » Xj + ... + x^, 

where In is the natural logarithm. Model (4) has constraints: 

(5) Ui(+) » U2(+) = ... - U(k+2)<*' = "12<**' ' "12<^*' ' 
' UijC+Xj) » Ui3(i + ) = = U(k+i)(k+2)<*''k' ° 

" "(k+l)(k+2)<*k-l*^ ' ••• ' "I23<***l' ' "123<^**1> = 
» Ui23(1t+) » ... - Ui23...(k+2)<*"l-"''k> ' 

• "I23...(k+2)<^**r"*k> = 

= "I23...(k+2)<^**l---*k-l*) " °- 
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The u-terms in model (4) describe main effects and Interaction 
effects of subgroup 1, score t and Item responses \. The 

u-terms In expression (5) denote sums of parameters that occur In 
model (4) where a plus sign replacing an Index indicates that the 
summation Is over the replaced index. The constraints (5), however, 
are not sufficient to ensure that all parameters In model (4) are 
estimable. Additional constraints must be imposed to obtain a 
unique solution of the model parameters. These constraints will be 
discussed later. 

Restrictive quasi-loglinear models are defined by setting u- 
terms in (4) equal to zero. The only models considered here will be 
hierarchical, i.e. whenever a particular u-term is set to zero, all 
Us higher order relatives must also be set to ^ero. 



The Rasch Model as a Quasi-Logl inear Model. 



A restrictive quasi-loglinear model is 



(6) 



In m{ 



itx 



1 



u + U]^(i) + ^2^t) 



with the constraints 



(7) 



ui(+) ■ U2(+) = 
. U3(+) 



= ... = U|^+2'*^ ' ^ 
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Model (6) can be obtained from the saturated quasi -logli near 
model (4) by setting all Interactions with and between Item 
responses equal to zero. 

If the subgroup and score are taken as fixed variables and the 
Item responses are considered as random variables, model (6) Is 
equivalent to the conditional Rasch model. In that case initx^...x^ 
Is the conditional expected frequency of the response X^«x^, .... 

for given subgroup 1 and score t. The conditional probablll^ 
of response X^=x^» ...» Xi^^^Xi^ for i and t can then be obtained from 
(6) by 

<8» V\l^'^^ • "'lt.^...x/f^- ""itx^.-.x^ 

X^+...+Xj^«t 

« exp(u2(x^)+...+Uj^^2<\''/f jf «*P(U3(*i)+***+U|^+2**k*^" 



Except for a reparametrizatior.» model (8) Is equivalent to model 
(2). In model (2) the effect -Xj6j of a response Xj«Xj on 1 em j 
is -6j for a correct response (Xj=l) and zero for an Incor^'ect 
response (Xj*0)» whereas In model (d) the effect of a correct 
response Is Uj^2(^) effect of an Incorrect response Is 

Uj+2<0)» ^^^^^ "j+2*^' ^ ''"j+2*^' constraints (7). Model (8) 

can be parametrized In the same way as model (2) If Uj4.2(^) 
added to each parameter Uj^2(^j) ^^^^ parameter Uj^2(^j) 
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+ Uj4.2(^) becomes Uj+2(0) ♦ ^'j+a'l) " ^ ^^^^ incorrect response 
and 2Uj+2{l^ ^'ith a correct response. This can be done by 
multiplying both numerator and denonimator by 

expCujd)*. ..♦U|^+2(l>>i 
so that model (8) becomes model (2) with 

for all j « 1, .... k; i.e. 6j » 2Uj^2<0' • '^his shows that the 
Rasch model Is equivalent to the quasi-loglinear model (6). 

In model (6) there is an obvious overparameterization because of 
the linear dependence of the item responses and the score: adding a 

constraint c to each of the item parameters Uj4.2(^) k) and 

substracting c from Uj+2(0) (j«l,...,lc) to satisfy the constraints 
(7) is equivalent to adding 

t.c - (k-t).c = (2t-k).c to U2{t). This indeterminacy can oe 
removed from model (2) by putting one linear constraint on the item 
parameters, e.g. by setting U|^^2^^k' equa] to zero. 

We now describe less restrictive quasi-loglinear models that can 
be used to detect item biu.. 

Quasi-Loglinear Models to Detect Item Bias . 

To studly item bias in a particular set of data, quasi logllnear 
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Models may be set up that contain subgroup-dependent Item 
parameters In addition to the parameters of the Rasch model (Rascn» 
1960). The fit of these models can be compared by a likelihood 
ratio test with the fit of more restrictive models to test the 
significance of each of the subgroup-dependent item parameters. If 
a test yields a significant result, the item is biased. The 
subgroup-dependent item parameters each describe a particular type 
of Item bias. 

To detect the simplest type of b1as» e.g. In item one, the model 

(9) 1" «1tx^...x^ « " * "lt^> * "2t^> * "I2t^^) * ^ 

+ U3lx^) + ... + U|^+2^*k' * "l3^^*l'» 

with the usual constraints (5), is compared with the loglinear 
Rasch model (6) to test the null Hypothesis that the interaction 
between the subgroup and the response to item one» u^jdx^) is 
zero. If the test is significant. It may be concluded that u^3(ix^) 
is not zero so that the difficulty of item one varies from subgroup 
to subgroup. The parameter u^3l1x^) is the change of item easyness 
in subgroup i and U3(x^) + u^3lix^) is the easyness of item x^ in 
subgroep i. 

In model (9) a u-term is specified to test item bias for only 
one item. Obviously similar u-terms can be specified for two or 
more Items If necessary. For example, comparing the loglinear Rasch 
model with the model: 
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(10) In »itxj...x,^ * " * "l*^' * "2*^* * u^2<^^* * "3(^1) ^ ••• 

+ U|c+2'^k^ * "la'^^l^ * UiA^ixg), 

yields a simultaneous statistical test for bias in both item one 
and Item two. 

An item may be more difficult in one subgroup than another, 
because the Item introduces some specific difficulty, e.g. reading 
ability, in which the members of one subgroup are generally more 
proficient than the members of another. If the ability to solve 
this difficulty varies from individual to individual within each of 
the subgroups and if there are two items in the test that both 
Introduce the same difficulty we may expect these items to show an 
Interaction that is not explained by the original latent trait. 

This interaction may be investigated using the model: 

(11) In «itx^...X(^ ' " * "1'^' * "2*^* * u^2^^^* * "3<^l) ♦ 

+ ... + U|^+2'^k* * "is'^^i^ * "14(1x2) ♦ 
+ U34(xiX2) + "134(^x^X2) 

which contains two u-terms, U34(x^X2) and "134(1x^X2) describing an 
interaction between item one and two. If "134(1x^X2) is zero but 
"34^^1^2^ is not zero, there is a simple interaction between both 
Items that is the same in all subgroups, h ^134(1x^X2) is not 
zero, the interaction is different from subgroup to subgroup. This 
may, for example, be the case if reading ability does introduce 
common variance in one subgroup. ' does not introduce any 
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variance In another s.ubgroup, because the individuals in that 
subgroup are all of superior reading ability. 

Comparing model (11) with the loglinear Rasch model (6) yields a 
test for the hypothesis that d11 subgroup-dependent item parameters 
in model (11) are simultaneously zero. If the test is significant, 
it may be concluded that one or more of these parameters are not 
zero. Comparing model (11) with model (10) yields a test for the 
item interaction terms alone. To test both item interaction terms 
"34(^1^2' "l34^^^1^2' separately^ an intermediate submodel must 
be defined that contains U34(x^x^) but not ^^34(1x^X2). 



table 2,3 



Table 2 lists all relevant models (a. through e.) containing 
subgroup-dependent item parameters for the case of two items. Table 
3 summarizes which models in Table 2 must be compared to test 
specific subgroup-dependent item parameters. Hypothesis 3 shows 
which models must be compared to test U34(x^X2) and ^^34(1x^X2) 
respecti vely. 

Hypothesis 1-4 in Table 3 refer to what Mellenbergh (1982) has 
called 'uniform' item bias. It means that the bias is constant 
within each subgroup. With 'nonuniform' item bias (Mellenbergh, 
1982) the bias of in each subgroup is dependent on the individuals 
ability level. Nonuniform bias may be studied with quasi-loglinear 
models containing item parameters that depend both on the subgroup 
and the score. 
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Table 2 shows a series of models (f. through m.) with subgroup- 
and score- dependent item parameters. Since quasi -logli near models 
are hierarchical » each model with a subgroup x score x item(s) 
Interaction term must contain the corresponding subgroup x item(s) 
interaction term. In Table 2 all models f through m contain a 
submodel from models a through e, which Is indicated by its letter 
for brevity. Table 3 shows which of these models must be compared 
to obtain a statistical test that is sensitive to a specific type 
of nonuniform item bias. Note that these tests concentrate only on 
the nonuniformity of the bias and not on the uniform part of the 
bias. Therefore, if these tests are not significant, items may 
still be uniformly biased. 

Hypothesis 5 In Table 3 concerns the simplest type of 
nonuniformity in item bias. If model 9 and f (Table 2) differ 
significantly, it can be concluded that the subgroup x score x item 
interaction u^23(^^^i) zero. This nonuniformity in iter;, bias 

may be expected, for example, if the difficulty of an Item varies 
from subgroup to subgroup for low ability individuals only, which 
is the case if an item involves a specific skill that is not 
mastered by the low ability individuals of only one cf the 
subgroups. 

Hypothesis 6 (Table 3) concerns this hypothesis for two Items 
simultaneously, whereas hypothesis 7 and 8 address the question 
whether Item interaction U nonuniform (u234(tx^X2)/0) or whether 
subgroup differences In Item Interaction are nonuniform 
(^1234^^ ^^1*2^^^* This may be called nonuniform common item bias. 
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where the amount of item bias that two items have In common depends 
on ability level. This type of item bias may occur, for example, if 
in only one subgroup two items introduce a common difficulty for 
low ability individuals but do not introduce a common difficulty 
for high ability subjects. 

In most of the models in Table 2, the constraints are not 
sufficient to ensure identifiabili^ of the model parameters. For 
example, the parameter U23(tXj) with t«0 and x^*! or t=k and Xj»0 
cannot be estimated because it corresponds to structurally zero 
cells only. A convenient way to determine the number of estimable 
parameters is to determine the rank of the information matrix, 
which should be equal to the number of estimable parameters for a 
given set of data (cf. McHugh, 1956; Goodman, 1974). Baker and 
Nelder (1978, sec. 4.3) describe a weighted least-squares algorithm 
for the analysis of contingency tables, which estimates the 
parameters in a sequential fashion. If a parameter is linearly 
dependent on the preceding parameters, or if there are no 
observations to estimate it from, the parameter is removed from the 
model, thus the information matrix is of full rank. 



Estimation and Testing 



The kernel of the log likelihood is 
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(12) in HJf^ ^tx^...*^' 

■ f ^tXj...X^ ^" "itx^.-.x^ 

Inserting a logllnear laodel for In ""itx^...x,5 likelihood 
yields a sum of products of model parameters (e.g. U3(x^)) with tne 
corresponding sufficient marginal counts (e.g. f++x^+ 4.). For 
example, using the logllnear Rasch model (6) In (12) gives 

(13) KRasch) =%...+u + ff^^^^ ..u^d) * ^ f^^*.. 

* f f ^t+...+"l2<^^> * W+...+"3<^' ••• 

*iF/*...*x/lc*2<^'- 

where a plus sign replacing an index denotes summation over that 
Index. 

Log likelihoods of larger models (e.g. Model 9) may De obtained by 

adding terms (e.g. f ^ f^^^ +...+"l3*^^l'* ^^^^ ' 

say model M - is a special case of another model - say model M* - 
model M* may tested against model M by -2 times the natural 
logarithm of the likelihood ratio of both models, or equlvalently, 
by -2 times the difference In log likelihood of both models 



(14) G^(M;M*) » .2(1(M).1(M*)) 
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Under the assumption of model H. Is asymptotically distributed 
as chl-square with degrees of freedom equal to the number of 
estimable parameters of both models (Bishop, Fienberg i Holland, 
1973» p. 525; Rao, 1965, p. 351). 

An overall goodness at fit test for model M Is obtained by 
testing It against the saturated model M* where In the expected 
cell counts (m) in (12) are set equal to the observed cell counts 
(f). 

For example the Rasch model (6) Is a special case of model (9). 
Model (9) has all parameters of the Rasch model but adds the term 
u^3(1x^). Testing model (6) against model (9) Is a test for the 
hypothesis Uj3(1x^) » 0. If the oaraineter estimates of both model 
(6) and (9) are known, the likelihood-ratio statistic G^(M;M*) can 
be calculated easily from the sufficient marginal sums 
corresponding to the parameters. 

Maximum-likelihood estimates of the model parameters can be 
obtained by setting the observed marginal counts corresponding to 
each of the parameters equal to the corresponding expected marginal 
counts and solving the resulting system of equations for the 
parameters (Haberman, 1979, p. 448). For example, for the Rasch 
model the maximum-likelihood equations are 



for 1 » 1, m; t « 0, ... , k 
and Xj « 0,1; j » 1, . .., k. 
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In general, for quasi -logli near models, the maximum-likelihood 
equations yield no direct solution of the model parameters. The 
equations must t>e solved iteratively. Algorithms to solve the 
maximum-lilcelihood equations for quasi-loglinear models have been 
described by Goodman and (1974: ECTA) and Baker and Nelder 
(1978: GLIM). Kelderman (1983) describes a generalisation to 
multlwiiy tables of an algorithm by Goodman (1964, 1968) that 
calculates the parameters of quasi -logli near models without setting 
up the entire incomplete contingency table, so that memory space 
required can be modest if the number of it^ms is not small. 

An Example . 

Kok (1982) studied Item bias in multiplication items by 
experimentally varying the test takers* skill in bias factors that 
can be expected to be operating in differently formulated test 
items. In this section, some of these data are reanalyzed to 
Illustrate the use of quasi -logli near models for the detection of 
item bias. 



tabU 4 



Table 4 shows the contents of six multiplication items. In item 
I through 4 the numoers are written out in Dutch and in item 5 and 
6 Roman numerals are used. The subjects were 286 Dutch 
undergraduates of which 144 randomly selected Individuals received 
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a short training In Roman numerals. can be expected that the 
Roman Items are biased. 



table 5 



In Table 5 for each Item the values of the likelihood ratio test 
and the degrees of freedom are shown for both uniform (hypothesis 
1. Table 3) and nonuniform bias (hypothesis 5, Table 3). From Table 
5 It Is seen that Item 5 and Item 6 are uniformly biased. There Is 
no nonuniform bias In this set of data. Since both Item 5 and 6 are 
written In Roman numerals, Wt^ wou^d expect both items to be olased 
by a common bias factor. To tost this, hypothesis 3 and 4 of Table 
3 are tested. Neither showec^ a significant result (G^(c;d)«0.2, 
0F«l;6^(d;e)-1.4, OF-1). We can, therefore, conclude that Item 5 
and i ten 6 are uniformly biased but not that the bias factors of 
both Items are the same. 

The model with both Item 5 and 6 uniformly biased (I.e. (10)) 
gives a good fit to the data (G^»106.8, 0F«107). The estimates fo«* 
the Item parameters U4(x2) through u^i^^) are 0.36, 0.40, -0.51. 
0.05 and 0.03 respectively for x-1; where the first Item parameter 
Is fixed at zero. The subgroup x Item response parameters u^^dx^) 
and u^g(ixg) for x-1 and 1>1, the group that received a training In 
Roman numerals, are 0.21 and .27 respectively. That Is, the items 
are much easier for the group that received the training In Roman 
numerals. 
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Item Bias Detection when subgroups are Unknown 

When subgroup membership Is unobserved the subgroup variable 
becomes a laten*- arlable. The models to detect Item bias then 
become latent-class models. For example, If the latent classes are 

denoted by taiufl m) , the latent class version of model (9) 

becomes 

^" Vx^...x^ = u+u^(u,)*U2(t)+u^2<uit)+U3{x^) 
*...+u^,2^x^)*Ui3(cuX^) 

w« l,...,m; x^*0,l;...; X|^=0,1; t=x^+.. .+X|^; with the usual 
constraints 5. 

Model (16) describes a Rasch model in each latent class where 
the difficulty of Item 1 be different in each latent class. The 
parameter u^^djc^) describes the differences In Item difficulty 
between the latent classes. If this parameter Is not zero. Item I 
Is biased "^ith respect to the latent classes. 

Latent-class models have been Introduced by Lazarsfeld (1950; 
Lazarsfeld i Henri, 1968; Goodman, 1978). At first, latent-class 
models assumed local Independence within each latent class. Goodman 
(1975) introduced latent-class models where the observed variables 
form an Incomplete-contingency table assuming quasi independence 
within each latent class. Finally, Haberman (1979, ch. 10) 
formulates a latent-class model for an incomplete table where the 
model is not necessarily an independence model. The model can be 

28 
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any 1<lent1f labia logllnear model containing unobserved categorical 
variables. Nodel (16) Is a special case of Haberman's general 
latent class model where Item 1 mety have a different difficulty In 
each of m latent classes, where the number m of latent classes Is 
specified by the Investigator. Not all latent class versions of the 
models to detect Item bias (Table 2) malce sense* since parameters 
Involving the latent-class variable may be wholy absorbed by lower 
order parameters Involving observed categorical variables only. 
These latent-class parameters are then redundant and not 
Identifiable. This holds true for most models for nonuni form-Item 
bias. 

For example, consider the latent class version of model g Table 

2: 

"c^x^...x^ • u*u^(a,hu2(thu^2(ut)*U3(x^H...*Uj^,2<\' 
+Uj3(wXj)+U23(tx^)+u^23^***^*l' 
The expected value of the observed coun s (t,x^,...,X|^) are then 

™nx^...x^ = explu*U2(t)*U3(x^)*...*u^^2(x^) 
+U23(tXj)+gj(tXj)} 

where 

g.Ctx.) « In r expju. (wj-hJ.^Cwtj+u.-^CwX. )+u.^^(u)tx )} 
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NOM 9(tX|) can be completely absorbed by u. U2(t), U3(x^) and 
U23(tx^) to obtain new parameters using the following 
reparametrlsatlon: 

u* = u + g^l++). 

u*(t) • Uglt) + g^(t+) . gj{++), 

u*{x^) » Ujix^) + g^l+x^) - g^l++). 

u*3{tx^) « Ugjitxj) + g^ltx^) - g^lt+) - g^l+x^) +?(++). 

where the notation 9^(t^) Is used to denote an average over the 
subscripts replaced by a plus sign. This shows that the latent* 
class terms in model (17) are redundant. Consequently there Is no 
latent-class version of test 5 of table 3. A similar argument holds 
for test 8; the latent class term "i234^***^*l*2' *<lsorbed by Its 
lower order relatives Involving observed variables t,x^ and X2. 

In the latent-class models for detecting uniform bias the 
latent-class parameters are not adsorbed. For example latent-class 
version of the model used to test one-item uniform bias 116) yields 
the expected values of the observed counts: 

{18a) ">+tx^...Xj^ " exp{u+U2(t)+U3lXj)+...+Uj^^2^*k'*92^^*l'J* 
irtiere 
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(18b) ggttx^) = In E exp{u^(w)*u^2<<*>^'*"i3*^lH* 

SO that If we set 

(19a) u* • u + g2(++) 

u* (t) - U2(t) + g2(t+) - g2(++) 

u|(Xj) » Uj(x^) + g2(+x^) - 92***' 

uj3(tx^) « g2(tx^) - g2(t+) - g2(+x^) + g(++) 

the model 

l"«nx^...x, •"**"2'^>*"3<^l' 

+U^(X2 )+. . •+"|^+2*\ '*"23 *^*1 ' 

satisfies the usual constraints (5). 

In model (18) the term 92^U^) is not absorbed oy lower order 
terms. The corresponding term u|j(tx^) descrloes a specific 
Interaction between the test score and Item 1. From (18b) It can be 
seen that this parameter arises both from differences In Item 
difficulty over latent classes (u^j(aK^)) as well as differences In 
testscore distribution In over latent classes (u^2(<^^)) * 
these effects are zero, the g2(txi) becomes constant over one 
Index, so that from (19a) u* (txj becomes zero. For example 
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If u^^((dx^) becomes zero 92 (tx^) does no longer depend on so 
that g2(txi) - 92(tx|) for all x| j« Xj. Consequently 
92(*J^) = 92(**) 92^^>^i^ « g2(tXj) for all x so that U23(tXj) 
becomes zero. 

If u^^Ctx^) Is nonzero, the Item charactic curve of Item one 
deviates from the ICC predicted by the Rasch niodel. This means that 
deviations of the ICC*s of a certain Item aay be explained as Item 
bias of that Item with respect to unknown subgroups. Introducing 
latent clases may provide an alternative to Introducing additional 
Item parameters as In the two and three parameter logistic 
testmodel. 

The latent-class versions of the remaining models for detecting 
uniform bias (model c-e. Table 2) also contain non-redundant 
latent-class terms. Writing the models for the expected values of 
the counts, the latent class parameters of model c-e similarly 
produce terms U234(tXj^X2) and lower order relative terms that are 
not allready specified In the observed part of the model. This 
means that score dependent Item Interaction may result from 
differences In Item difficulty or differences in Item Interaction 
between latent subgroups. 

Methods fur the estimation and testing of latent-class-quasi- 
logllnear models differ from those for ordinary quasl-loglinear 
models. Since latent class memoership is unobserved, the 
freoi«<»ncios f ^ are not known. Consequently, the m^ximum- 
likelihood equations {e.g. f ^ » m . for parameters involving 
latent classes u (e.g. u.^(bp()) cannot be solved because the 
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frequencies are unknown. Habeman (1979» ch. 10), however, gives 
the a rule for the derivation of maximum likelihood estimates In 
latent-class models from the known frequencies f+txj,,.Xj^' 
that: "The same maximum-likelihood equations apply as In the 
ordinary case In which all frequency counts are directly observed, 
except that the unobserved counts are replaced by their estimated 
conditional expected values given the observed marginal totals". 
Under some logllnear model M (e.g. Model (16)), these estimates are 

(20) - L.(f ^ |f.^^ ^ ) 

h^X ^ * * *^|^ ^tX ^ . • • X|^ ^tx^ ... Xj^ 
t ■ X^ + ... + X|^ 

For model (16) the likelihood equations would then become 



^(Dt+...+ ^ "wt+...+ • ^•H-x^ + ...+ ' "*++x^+. ..+ 



* J.V and f . X * X X 
+..,+Xj^ ♦...+Xj^ urXj+. w#'Xj+..,+ 



The estimated counts T are obtained from (20) where the m are 
described by model (16). A scoring algorithm to solve these 
equations has been described by rlaberman (1979, p. 556). An 
alternative way to solve these equations. Is by using the E-M 
algcrlthffl (Dempster Laird i Rubin, 1977) with (20) as the expecta- 
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tlon step and (21) as the maximization step. 

Discussion 

In this paper an Item bias detection method Is pi^oposed that 
uses a Rasch latent trait as an Internal criterion for ability. 
Latent trait parameters of the model are removed from the model b> 
conditioning on the number right score and the quasi logllnear 
formulation of the model Is extended with parameters that describe 
diffrent types of Item bias. The general theory of (quasi-) 
logllnear models Is used to obtain maximum likelihood parameter 
estimates and lllcellhood ratio tests. 

Using Haberman*s (1979) latent class generalisation of quasi* 
logllnear models It Is shown that even If subgroup membership Is 
unknown It Is still possible to determine whether different 
Individuals with the same ability level have different 
probabilities of a correct response on a certain Item. 

It Is also shown that nonzero Item bias parameters with respect 
to latent classes can alternatively be modelled as parameters that 
describe deviations of Item difficulty In different scoregroups. 
This means that the Item characteristic curve of that Item deviates 
from Item characteristic curve predicted by the Raschmodel. 
Consequently » at least part of the structure In the Item responses 
that Is explained by slope parameters In the Bimbaum model may be 
explained as Item bias. Since Item bias can be Interpreted «s 
multldlmenslonallty. Item specific slope parameters may partly be 
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explained as multldlnenslonallty the Item response. 

The models presented In this paper have two parts: one part 
contains parameters describing Item bias, the other part contains 
parameters for the Rasch measurement model. It may be objected that 
the Rasch model is too restrictive a model for the measurement part 
and that a less restrictive, possibly multidimensional model, is 
preferred. Two remarks in favour of the Rasch model are in order 
here. 

Firstly, as was seen before, there is a trade-off between the 
complexity of the item bias part of the model and the measurement 
part of the model. A more complex measurement model, e.g. a model 
with slope parameters for the item characteristic curve, may hinder 
the identification of certain types of item bias. Therefore if 
identification of item bias is the objective and nothing is known 
about the right (possibly multidimensional) measurement model, a 
simple measurement model is to be preferred. Un'iike many other itcS! 
bias detection methods a check of the adequacy of the item bias 
detection model is available because the overall fit of the model 
can be tested by a chi-square test. 

Secondly, in general it is more desiraole to construct 
unid^mensional than multidimensional test items because the 
interpretation of the responses is less ambiguous. Even if a 
multidimensional test or item bank is needed to cover a certain 
content domain it is better to construct a number of homogeneous 
subsets of items. In that case the models presented in this paper 
can be applied to short subtests. Obviously, it is more probable 
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that Short subtests fit the Rasch model than that long subtest do. 
For one Iteii the Rasch model Is trivially true. 

Item bUs detection methods using an Internal ability criterion, 
assume that a good measure of this criterion Is available. I.e. 
that the Item us^J measure this criterion fit the measurement 
model. If that Is not the case, particularly If one or more of 
these Items are biased themselves, the results may be erroneous. 
Marco (Lord, 1980, p. 228) proposed a procedure to purify a test of 
biased Items. The total test Is analyzed. Items that appear to be 
biased are removed and the remaining Items are used as an Internal 
ability criterion to test the bias of all the testitems one by one. 
Although this procedure does not escape the Inherent circularity of 
the problem It should suffice If not too many Items are biased. 
This procedure can also be used with the test presented In this 
paper where In the first phase only one Item-uniform bias Is tested 
and In the second cycle the set of unbiased items is combined with 
pairs of possibly Mased items to use the diagnostic tests 
presented in this paper. 

Finally it should be remarked that the item bias part of the 
models may be more elaborate. The models in this paper contain 
parameters that indicate deviations due to item bias. Kok and 
Nellenbergh (1985) goes further and formulates models that describe 
the actual processes Involved in the genesis of item bias more 
precisely. Our models m^y be used to give directions as to which of 
Kok's models may be appropriate. 
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Table 1 

Frequency Counts and Structural Zero's in Subgroup 
1 X Score X Item 1 x ... x Item 3 Table. 



Score t 



Iten Response 0 1 2 3 



Xj X2 Xj 



0 0 0 f^Qooo 

10 0- f^iioo - 

0 10- f^ioio - 

0 0 1- fuool - 

110- - fi2no - 

10.- - fi2ioi - 

oil- - fi2oii - 

111- - - ^13111 



Note. Dashes denote structurally zero cells. 
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Table 2 

Quasi *1o9l1 near Models for Detecting Item Bias. 



Models with Subgroup^Dependent Item Parameters 



a, Rasch ^ u^3(1x^) 

b, Rasch * u^4(1x2) 

c, Rasch ♦ u^3(1x^) + 0^4(1x2) 

d, Rasch * u^3(1x^) ^ 0^4(1x2) ^ "34(^1^2' 

e, Rasch ♦ u^3(1x^) + 0^4(1x2) ♦ U34(x^X2) + u^34(1x^X2) 



Models with Sub(;roup and Score-Dependent Item Parameters 



f. (a) ^ ^23^^^!^ 

g. (a) + U23(tx^) + u^23*^^M' 

h. (b) ^ U24(tX2) 

1. (b) + U24(tX2) + u^24*^^*2' 
J. (c) + U23(tx^) ^ U24(tX2) 

k. (c) + U23(tx^) ^ U24(tX2) + Ux23'^^*l' * "i24<^^2' 
1. (d) + U23(tx^) + U24(tX2) + u^23'^^*l' * Ui24^^^*2' * 
+ U234(tx^X2) 

m. (e) + U23(tx^) + U24(tX2) + u^23*^^*l' * "124(^^*2' * 
♦ "234*^^*2* * u^234^^^M*2' 
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Table 3 



Comparison of Quasi-logl Inear Models to Test u-terms for Item Bias 
Hypotheses. 





Hypothesis 


Hodel Forms 


Compari son 










of Models 




Uniform Bias 


1. 


One Item uniformly biased 


Ujjdx^) 


Rasch 


- a 


2. 


Two Items uniformly biased 


Ui3(1x^), Ui4(1x2) 


Rascn 


- c 


3. 


Two Items with comnon uniform 


U34(xiX2) 


c 


- d 




bias: 








4. 


Two Items with common uniform 


Ui34(ix^X2) 


d 


- e 




bias: subgroup dependent 










Interaction 








Nonuniform Bias 


5. 


One Item nonuniformly biased 


Ui23(1"-x^) 


f 


- 9 


6. 


Two Items nonuniformly biased 


"l23'^^*l'' "123^^^*2' ^ 


- k 


7. 


Two Items with comnon non- 


U234(tx^X2) 


k 


- 1 




uniform bias 








8. 


Two Items with common non- 


"1234*^^*1^2' 


1 


- m 




uniform bias: subgroup 










dependent interaction 
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Table 4 

Multiplication Items in Dutch and Roman Numerals (from Kok 1982) 



I tern Nul t1 pi 1 cation Contents 



1 7 X 1214 zeven x twaalfhonderdveertien 

2 16 X 21 zestlen x eenentwintig 

3 16 X 14 zestlen x veertlen 

4 6 X 4123 zes x eenenveertlghonderd- 

driegntvfintig 

5 8 X 214 VIII X CCXIV 

6 5 X 1318 V X MCCCXXVIII 
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Table 5 



Likelihood-ratio Tests for Uniform and Nonuniform Item Bias. 



Nonuniform Bias 



'■^(f-.g) OF 



0.9 
3.2 
0.8 
3.5 
4.0 
3.5 



* p < .05 
** p < .005 



Item Uniform Bias 

G^iRasch-.a) OF 



1 1.7 

2 2.4 

3 3.2 

4 3.5 

5 4.8* 

6 9.9** 
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